Skip to content

[pull] master from ray-project:master#1069

Merged
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master
Jun 12, 2026
Merged

[pull] master from ray-project:master#1069
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master

Conversation

@pull

@pull pull Bot commented Jun 12, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

goutamvenkat-anyscale and others added 5 commits June 11, 2026 19:26
#63890)

obstore's S3Store defaults region to us-east-1 and does not follow AWS
PermanentRedirect responses, so any obstore-routed S3 request against a
bucket in a different region fails non-retryably with BareRedirect.

- `_split_obstore_uri` rewrites
`https://s3.<region>.amazonaws.com/<bucket>/<key>` to s3://<bucket> +
<key> so StoreRegistry can apply region discovery.
- `_discover_aws_bucket_region` resolves a bucket's region via
`pyarrow.fs.resolve_s3_region` (already a required Ray Data dependency),
cached per bucket. PyArrow issues the `x-amz-bucket-region` HEAD probe
and handles the legacy global endpoint / IMDS edge cases; we
additionally cache negative results so unresolvable buckets are probed
at most once. The probe runs outside the cache lock, and the write-back
never lets a `None` result overwrite a region a concurrent thread
already cached (a real region always wins), so racing first-time lookups
can't intermittently disable region injection.
- `StoreRegistry.get` injects the discovered region for `s3://, s3a://`
URLs, skipping injection when the caller already supplied a region or a
custom endpoint (MinIO/R2/etc.).
- All obstore call sites — the HEAD size probe (`_resolve_size`), the
actor HEAD path (`_head_one`), ranged downloads (`_fetch_ranged`), and
whole-file GET (`_fetch`) — go through `_split_obstore_uri`, so a
path-style cross-region URL no longer slips past the rewrite (which
previously left the size probe on the regional HTTPS store, returning
size 0 and wrongly skipping ranged downloads).

GCS and Azure are unaffected: neither encodes region in the endpoint
(GCS uses a global endpoint addressed by bucket name; Azure is keyed by
storage account), so they have no cross-region redirect failure mode.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tream ray.get (#64014)

`_next_sync` documents *"if an object is not available within the given
timeout, it returns a nil object reference"*, but its end-of-stream
handling calls `ray.get(generator_ref)` with **no timeout** (to
distinguish a normal end of the stream from a task failure).

The bug: after all yielded refs are consumed, the next
`_next_sync(timeout_s=...)` call reaches that get. The generator's
return object normally resolves locally, but if it lives in plasma and
its node died, the get blocks the calling thread until lineage
reconstruction re-runs the task — which needs a free CPU. On a saturated
cluster this can deadlock: the blocked caller (e.g. Ray Data's
scheduling thread) is what consumes outputs and releases the CPUs held
by output-backpressured tasks, so reconstruction can never start.
Serve's `to_object_ref(timeout_s=...)` similarly blocks past the user's
requested timeout when a replica node dies.

- Apply the caller's `timeout_s` to the end-of-stream get; report a
timeout as a nil ref (retry), per the documented contract.
- `timeout_s=None` (and `-1`) keep the blocking behavior, so `__next__`
and other timeout-less callers are unchanged.
- Regression test: stream exhausted + return object lost with its node →
`_next_sync(timeout_s=0)` returns nil instead of blocking (hangs forever
without the fix), and the stream terminates normally once the node is
restored.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…64034)

The Windows base image build (ci/ray_ci/windows/build_base.sh) crashes
when running `conda update -c conda-forge ca-certificates certifi`:

AttributeError: module 'lib' has no attribute
'X509_V_FLAG_CB_ISSUER_CHECK'

Upgrading the conda base env to python 3.10 (`conda install python=...`)
pulls cryptography>=38, which removed
`_lib.X509_V_FLAG_CB_ISSUER_CHECK`. pyopenssl is not part of that
transaction, so the stale py3.8-era pyopenssl is left behind and still
references the removed attribute at import. The next conda invocation
imports requests -> urllib3.contrib.pyopenssl -> OpenSSL.crypto and
detonates before conda can run, failing the base image build.

Co-resolve pyopenssl 23.2.0 in the same conda install transaction so it
stays compatible with the cryptography 38.x that gets installed.

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… id (#64044)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…in (#64021)

## Why are these changes needed?

`RAY_SERVE_PORT_QUARANTINE_S` holds a released direct-ingress replica
port out of the allocation pool so that stale routing state pointing at
the old replica drains before another replica can inherit the port. It
currently defaults to **10 seconds**.

The consumers that hold stale routing state the longest are soft-stopped
(reloaded-out) HAProxy worker processes: they run no health checks (see
[haproxy#3330](haproxy/haproxy#3330)) and keep
routing to their frozen server list until `hard-stop-after` fires —
**120s by default** (`RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S`), commonly
configured higher. With the current defaults the quarantine is 12x
shorter than the window it exists to outlive: a freed port can be handed
to a *different app's* replica at +10s while old workers keep sending it
the previous app's traffic for up to +120s.

Observed in sustained load testing: a just-freed direct-ingress port was
recycled into another app's replica inside the stale-worker window, and
a soft-stopped worker routed the old app's traffic to it — surfacing as
unretried wrong-app 404s at the client. Health checks cannot catch this
(they validate the address is serving, not which app is serving).

## What does this change do?

Derives the default quarantine from the hard-stop window instead of a
fixed 10s:

```python
RAY_SERVE_PORT_QUARANTINE_S = get_env_float_non_negative(
    "RAY_SERVE_PORT_QUARANTINE_S",
    float(RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S + 30),
)
```

The `+30s` margin covers the broadcast/coalesce/reload latency that
elapses before an old worker's hard-stop clock starts (the clock runs
from the worker's *orphaning* at reload, which can lag the port
release). An explicit `RAY_SERVE_PORT_QUARANTINE_S` still overrides, and
`0` still disables quarantining entirely.

Sizing rule this encodes (must hold for correctness, now holds by
default):

```
port quarantine >= hard-stop-after + reload propagation lag
```

Signed-off-by: harshit <harshit@anyscale.com>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
@pull pull Bot locked and limited conversation to collaborators Jun 12, 2026
@pull pull Bot added the ⤵️ pull label Jun 12, 2026
@pull pull Bot merged commit d379709 into garymm:master Jun 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants